414333 - Özgür Polat
417121 - Hüseyin Can Minareci

Introduction

With more and more consumers abandoning their credit card programs, a manager at the bank is concerned. They would really appreciate if one could foresee who would be churned for them so that they can proactively go to the consumer and offer better value to them and turn the decisions of customers in the opposite direction. Thus, they aggregated this dataset and the source we acquired it received it from https://leaps.analyttica.com/

Here in this project we will try to enlight the big picture a bit more with the capabilities we gained thanks to the Advanced Visualization in R course in Faculty of Economical Sciences at the University of Warsaw.

Data Definition

It is the best to start with understanding the variables we have and their definitions.

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then 1 else 0
  • Customer_Age: Demographic variable - Customer’s Age in Years
  • Gender: Demographic variable - M=Male, F=Female
  • Dependent_count: Demographic variable - Number of dependents
  • Education_Level: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
  • Marital_Status: Demographic variable - Married, Single, Divorced, Unknown
  • Income_Category: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
  • Card_Category: Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
  • Months_on_book: Period of relationship with bank
  • Total_Relationship_Count:Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal:Total Revolving Balance on the Credit Card
  • Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
  • Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • Avg_Utilization_Ratio: Average Card Utilization Ratio
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1: Naive Bayes
  • Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2: Naive Bayes

Library Imports

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(cowplot)
library(ggforce)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggpubr)
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:cowplot':
## 
##     get_legend

Data Import

churn <- read.csv("Data/churn.csv", sep = ',')
churnN <- read.csv("Data/churn.csv", sep = ',', na.strings = c("NA", "N/A", "Unknown"))

Data Preparation

colSums(churn[,c(6,7,8)]=="Unknown")
## Education_Level  Marital_Status Income_Category 
##            1519             749            1112
## I think we should remove those Unknowns
churnNN <- drop_na(churnN)
prop.table(table(churn$Attrition_Flag))
## 
## Attrited Customer Existing Customer 
##         0.1606596         0.8393404
prop.table(table(churnNN$Attrition_Flag))
## 
## Attrited Customer Existing Customer 
##         0.1571812         0.8428188

After dropping Unknowns we are having very similar distribution and I would say lets drop it in order to have better EDA

Exploratory Data Analysis

Plot 1

ggplot(churnNN, aes(x = Credit_Limit, fill = Education_Level)) + 
  geom_histogram(data = churn[,-6], fill = "grey", alpha = .5) + 
  geom_histogram(colour = "black") +
  facet_wrap(~ Education_Level) + 
  guides(fill = FALSE)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

unique(churnNN$Education_Level)
## [1] "High School"   "Graduate"      "Uneducated"    "College"      
## [5] "Post-Graduate" "Doctorate"

Plot 2

my_comp <- list( c("Uneducated", "High School"), c("High School", "College"), c("College", "Graduate"), c("Graduate", "Post-Graduate"), c("Post-Graduate", "Doctorate") )

ggviolin(churnNN, x = "Education_Level", y = "Total_Revolving_Bal",
          fill = "Education_Level", palette = "jco",
          add = "boxplot", add.params = list(fill = "white"))  + 
  stat_compare_means(method = 'anova') +
  stat_compare_means(comparisons = my_comp)

Plot 3

Correlation plot

library(heatmaply)
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridis
## Loading required package: viridisLite
## 
## ======================
## Welcome to heatmaply version 1.1.1
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
library(plotly)
library(ggcorrplot)
churn_numeric <- select_if(churnNN, is.numeric)

churn_ready_for_corr <- churn_numeric %>% 
  select(1:15)


# Compute correlation coefficients
corr <- churn_ready_for_corr %>% 
  cor()


# Compute correlation p-values
cor.test.p <- function(x){
    FUN <- function(x, y) cor.test(x, y)[["p.value"]]
    z <- outer(
      colnames(x), 
      colnames(x), 
      Vectorize(function(i,j) FUN(x[,i], x[,j]))
    )
    dimnames(z) <- list(colnames(x), colnames(x))
    z
}
p <- cor.test.p(churn_ready_for_corr)

# Create the heatmap
heatmaply_cor(
  corr,
  node_type = "scatter",
  point_size_mat = -log10(p), 
  point_size_name = "-log10(p-value)",
  label_names = c("x", "y", "Correlation")
)

Conclusion